getwd()
## [1] "/Users/TARDIS/Documents/STUDIES/context_word_seg"
library(ProjectTemplate)
load.project()
transcript_lengths <- df %>%
as.tbl() %>%
extract(col=utt, into=c("child", "age", "utt"), regex="([[:alpha:]]{2})([[:digit:]]+).*_([[:digit:]]+)$", convert = TRUE) %>%
group_by(child, age) %>%
summarize(N.utts=max(utt))
ggplot(transcript_lengths, aes(y=N.utts, x=age, fill = child)) +
geom_bar(stat = "identity") +
scale_x_continuous(breaks=6:16, minor_breaks = NULL) +
labs(x="Age (weeks)", y="Number of utterances", title = "Transcript lengths by age")
ggsave(filename = "transcript_lengths_1.png", path = "graphs/descriptives", width = 5, height = 4, units="in")
ggplot(transcript_lengths, aes(y=N.utts, x=child)) +
geom_boxplot() +
geom_point(aes(color=child)) +
labs(x="Child", y="Number of utterances", title = "Transcript lengths by child")
ggsave(filename = "transcript_lengths_2.png", path = "graphs/descriptives", width = 5, height = 4, units="in")
Sequence plots: Show the sequence of coded contexts over time for each transcript.
source("src/sequence_plots.R")
In each of the following plots, the activity contexts are shown on the y-axis, and the course of the transcript runs along the x-axis, from the first utterance to the last utterance. Darkening in the colored bar for each context indicates that that context is coded as happening at that utterance. Utterance number is a rough proxy for time of day since time stamps are not available in the corpus; while there are obvious shortcomings with this noisy measure, it does preserve the sequential ordering of events even if it loses information about precise timing.
There are several general methodological points of interest visible in these plots. When defining ‘activity context’ using the word list approach, there can be (and often are) sections of the corpus that receive no context tag at all (vertical sections in the plot where there are no darkened context bars). This is less common with the other appraoches to defining context, giving the word list plots a relatively sparse appearance. In the coder judgment plots, there are some vertical bars with no data at all — those represent the few parts of the corpus that are not fully coded (there are not 5 independent coders for each utterance). In order to determine context using topic modeling, I binned the transcripts into documents of 30 utterances each (necessary to allow large enough samples of speech to estimate the word co-occurrence rates on which topic modeling algprithms are based). This means the topic modeling approach tags the corpus in bins of 30 utterances, resulting in chunks of each context that are at least 30 utterances long. The word list method identifies chunks of at least 5 utterances due to the smoothing procedure that picks up 2 utterances before and after each tagged utterance. The coder judgments are based on sections of the corpus that were 30 utterances long, but because each utterance is often coded by more than the required 5 coders a random sample of 5 coders for each utterance results in a more natural gradient at the edges of context chunks.
Note that contexts with fewer than 100 total utterances across the entire corpus (less than 1% of the corpus) are not shown. This omits 1 context using the word list definition of context and 13 contexts using the coder judgments.
The following are four sequence plots from one transcript (child la at 16 weeks, the longest transcript in the corpus).
Defining context using key words from the word lists: Defining context using coder judgments:
Defining context using topics from LDA topic modeling:
Defining context using topics from STM topic modeling (allows for variability family to family when discovering topics):
It is also possible to see hints of agreement across methods by examing the plots. For example, there is a section around utterance 1300 that appears to be identified as “bathtime” by human coders, includes words from the “bath” word list, and is mostly topic 8 by the STM topic modeling (that episode is not as clearly picked out by the LDA topic modeling, although perhaps topics 10 and/or 1 correspond). Note that the LDA topic modeling identifies mostly one context for the duration of the transcript (topic 11), possibly getting caught on family-specific words like the child’s name.
The following are four sequence plots from one transcript (child gl at 6 weeks). There appear to be a couple naps during this transcript: one around utterance 150 and another beginning around utterance 400 (and possibly another right at the beginning of the transcript, as identified by the word list and coder judgment methods only). There also appears to be a bath around utterance 550, according to the word list and coder judgment plots, although it is not marked with topic 8 in the STM plot, unlike with the previous example. Again, the LDA topic modeling uses one topic heavily throughout the transcript (topic 7).
Defining context using key words from the word lists: Defining context using coder judgments:
Defining context using topics from LDA topic modeling:
Defining context using topics from STM topic modeling (allows for variability family to family when discovering topics):
df_all_long <- df_all %>%
gather(key = "key", value = "value", -utt, -orth, -phon) %>%
extract(col = key, into = c("method", "context"), regex = "(^[[:upper:]]{2,3})_(.*)$", remove = FALSE) %>%
extract(col = utt, into = c("child", "age", "utt"), regex = "(^[[:alpha:]]{2})([[:digit:]]+)[.]cha_([[:digit:]]+)$", convert = TRUE)
min.utts <- 100
contexts_keep <- df_all_long %>%
group_by(key) %>%
summarize(N.utts = sum(value, na.rm = TRUE)) %>%
filter(N.utts > min.utts)
df_keep_long <- df_all_long %>%
dplyr::filter(key %in% contexts_keep$key) %>%
dplyr::select(-key)
contexts_by_transcript <- df_keep_long %>%
group_by(child, age, method, context) %>%
summarize(N.utts = sum(value, na.rm = TRUE))
for(m in unique(contexts_by_transcript$method)){
plot.data <- contexts_by_transcript %>%
dplyr::filter(method == m) %>%
dplyr::filter(N.utts > 0)
n.levels <- length(unique(plot.data$context))
colors <- c(brewer.pal(9,"Greens")[c(4,6,9)],
"#FFFF33",
brewer.pal(9,"YlOrRd")[c(4,6,9)],
brewer.pal(9,"YlGnBu")[c(4,6,8)],
brewer.pal(9,"PuRd")[c(5,7)],
brewer.pal(9,"Purples")[c(5,7,9)])
p1 <- ggplot(plot.data, aes(y = N.utts, x = as.factor(age), fill = context)) +
geom_bar(stat = "identity") +
facet_wrap(~ child , scales = "free") +
scale_fill_manual(values=colors) +
labs(y = "Number of utterances", x="Age (weeks)", title= paste0("Context distribution across transcripts\n", m))
ggsave(plot = p1, filename = paste0("contexts_by_transcripts_", m, ".png"), path = "graphs/descriptives", width = 14, height = 10, units="in")
p2 <- ggplot(plot.data, aes(y = N.utts, x = as.factor(age), fill = context)) +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~ child , scales = "free") +
scale_fill_discrete(breaks = sort(levels(plot.data$context))) +
labs(y = "Proportion of utterance codes", x="Age (weeks)", title= paste0("Context distribution across transcripts\n", m))
ggsave(plot = p2, filename = paste0("contexts_by_transcripts_", m, "_prop.png"), path = "graphs/descriptives", width = 14, height = 10, units="in")
}
These plots show how many utterances fall in each context for each transcript (each child at each age), for each of the appraoches to defining context. For each approach, there is one plot showing raw number of utterances in each transcript and a second plot presenting the same information in terms of proportion of total context codes for each transcript.
One of the most important things to note in these plots is which contexts tend to be distributed more or less evenly across families and ages, and which contexts appear to be family- or age-specific. There is a particularly striking difference between the two topic modeling methods, LDA and STM, with the latter showing rather even distribution of contexts whereas the former appears to have a strong tendency to pick one or two dominant topics for each family. This may be due to the fact that STM (unlike LDA) allows for variability family to family in the prevalence and characteristics of each topic during estimation. Because LDA lacks this flexibility, it may get caught up on common words that are family-specific and miss patterns of words that vary within families as a consequence. The most obvious example of family-specific words are the children’s names, but there are several other words that, for one reason or another, appear often only in one family and not the others.
Defining context using key words from the word lists:
Defining context using coder judgments:
Defining context using topics from LDA topic modeling:
Defining context using topics from STM topic modeling (allows for variability family to family when discovering topics):
top_words_lda <- top.topic.words(lda$topics, num.words=20) %>%
as.data.frame(stringsAsFactors = FALSE)
colnames(top_words_lda) <- paste0("topic_", 1:ncol(top_words_lda))
top_words_lda %>%
kable(caption="Top words for each topic, in order")
| topic_1 | topic_2 | topic_3 | topic_4 | topic_5 | topic_6 | topic_7 | topic_8 | topic_9 | topic_10 | topic_11 | topic_12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| oh | hey | go | yes | hello | yes | oh | ya | oh | oh | mm | oh |
| hey | oh | hey | oh | hmm | oh | come | go | go | yes | mummi | yes |
| dear | mummi | want | well | oh | tell | dear | come | come | ah | yes | tickl |
| hannah | yes | can | boo | dear | go | yes | oh | yes | dear | eh | dear |
| alright | hannah | hmm | hello | got | hey | tell | littl | ya | one | girl | go |
| trea | hello | come | go | gillian | come | good | got | eh | clean | hello | come |
| ssh | look | nice | say | matter | dear | got | yes | bath | two | come | got |
| yum | hold | see | come | smile | look | darl | look | alright | wee | oh | big |
| darl | smile | smile | dear | look | stori | girl | like | get | dirti | good | see |
| like | shh | littl | clever | hey | daddi | mum | want | like | now | alright | good |
| stretch | girl | look | christoph | want | got | windi | bit | want | pet | go | thumb |
| thank | hmm | like | hey | littl | can | mummi | make | mum | dri | hey | hey |
| chang | chou | play | ya | go | get | nappi | see | good | hair | big | oop |
| cri | love | hold | know | mummi | nois | hey | eh | arm | wash | know | fat |
| nois | bubbl | now | boy | face | funni | stori | nice | splash | away | darl | one |
| now | dear | take | good | like | teddi | just | just | back | get | nice | toe |
| pie | lambchop | hand | hand | ah | make | put | shall | now | got | girli | bad |
| beebo | nice | oh | can | big | smile | minut | get | big | hide | hannah | finger |
| dub | blow | well | wrong | funni | nose | chang | one | kick | nice | better | back |
| girl | littl | kick | look | girl | round | pet | big | gonna | anoth | daddi | get |